From Raw to Technically Correct Data

(from de Jonge van der Loo)

A data set is a collection of data that describes attribute values (variables) of a number of real-world objects (units). With data that are technically correct, we understand a data set where each value:

  1. can be directly recognized as belonging to certain variable, and
  2. is stored in a data type that represents the value domain of the real-world variable.

This means that for each unit, a text variable should be stored as text, a numeric variable as a number, and so on, and all this in a format that is consistent across the data set with appropriate variable (column) names. I am avoiding to comment the case in which the variable is categorical, as thus it would be necessary to use labels. I'll skip this, but you have further reading about this in the following link: http://pandas.pydata.org/pandas-docs/stable/categorical.html

The Objective is to:

  • load data in a DataFrame with suitable columns and index names
  • have each column of the DataFrame is of the type that adequately represents the value domain of the variable in the column
  • not loose data (or at least as less as possible)

Note that:

  • we don't check if the data makes sense

Reading text data to DataFrame

We have already seen in previous notes that Pandas already provides libraries that get a file as an input and provide a DataFrame as an output. This is the most suitable way of reading text in Pandas, however, we always can make use of the csv library from Python, converting and converting the data to a DataFrame using any of the DataFrame constructors.

We will cover some of the most common uses of Pandas regarding the csv file format reading libraries, for other file formats and further options, please refer to http://pandas.pydata.org/pandas-docs/stable/io.html.

CSV reading

Nowadays, the most common data format is CSV, which are tabular data files which use the comma separator to divide variable data. There are two main kinds of CSV standards: comma and semi-colon separated files.


In [ ]:
fname = "../data/people.csv"
with open(fname) as f:
    content = f.readlines()
print(content[:5])

Pandas provide read_csv function, which takes a csv file as input, and returns a DataFrame.


In [1]:
import pandas as pd
from IPython.display import display, HTML

path_to_file = "../data/people.csv"
df = pd.read_csv(path_to_file)
display(df)
print("Types:")
display(df.dtypes)
print(df.describe())


Unnamed: 0 Age[years] Sex Weight[kg] Eye Color Body Temperature[C]
0 individuum 1 42 female 52.9 brown 36.9
1 individuum 2 37 male 87.0 green 36.3
2 individuum 3 29 male 82.1 blue 36.4
3 individuum 4 61 female 62.5 blue 36.7
4 individuum 5 77 female 55.5 gray 36.6
5 individuum 6 33 male 95.2 green 36.5
6 individuum 7 32 female 81.8 brown 37.0
7 individuum 8 45 male 78.9 brown 36.3
8 individuum 9 18 male 83.4 green 36.6
9 individuum 10 19 male 84.7 gray 36.1
Types:
Unnamed: 0              object
Age[years]               int64
Sex                     object
Weight[kg]             float64
Eye Color               object
Body Temperature[C]    float64
dtype: object
       Age[years]  Weight[kg]  Body Temperature[C]
count   10.000000   10.000000            10.000000
mean    39.300000   76.400000            36.540000
std     18.263808   14.274531             0.279682
min     18.000000   52.900000            36.100000
25%     29.750000   66.600000            36.325000
50%     35.000000   81.950000            36.550000
75%     44.250000   84.375000            36.675000
max     77.000000   95.200000            37.000000

In [2]:
df.columns


Out[2]:
Index(['Unnamed: 0', 'Age[years]', 'Sex', 'Weight[kg]', 'Eye Color',
       'Body Temperature[C]'],
      dtype='object')

Some things to note:

  • recognizes the first line as column names
  • recognizes variable types
  • automatically assign row index

Useful options of read_csv:

  • sep : str, defaults to ',' for read_csv(), \t for read_table()

    Delimiter to use. If sep is None, will try to automatically determine this. Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data. Regex example: '\r\t'.

  • header : int or list of ints, default 'infer'

    Row number(s) to use as the column names, and the start of the data. Default behavior is as if header=0 if no names passed, otherwise as if header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of ints that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

  • names : array-like, default None

    List of column names to use. If file contains no header row, then you should explicitly pass header=None.

  • index_col : int or sequence or False, default None

    Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names).

  • dtype : Type name or dict of column -> type, default None

    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object to preserve and not interpret dtype.

  • skiprows : list-like or integer, default None

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

  • skipfooter : int, default 0

    Number of lines at bottom of file to skip (unsupported with engine=’c’).

  • nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files.

  • na_values : str, list-like or dict, default None

    Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''.

XLS reading

Another common file format is the xls file that contains an excel spreadsheet. In this case, each cell is a DataFrame cell, and columns are DataFrame columns and the same holds for rows.


In [3]:
import pandas as pd

country_df = pd.read_excel("../data/country_info_worldbank.xls",skiprows=[0,1,2],header=1)
country_df.head(8)


Out[3]:
Unnamed: 0 Unnamed: 1 Economy Code Unnamed: 4 Region Income group Lending category Other
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 NaN Afghanistan AFG NaN South Asia Low income IDA HIPC
2 2 NaN Albania ALB NaN Europe & Central Asia Upper middle income IBRD NaN
3 3 NaN Algeria DZA NaN Middle East & North Africa Upper middle income IBRD NaN
4 4 NaN American Samoa ASM NaN East Asia & Pacific Upper middle income .. NaN
5 5 NaN Andorra ADO NaN Europe & Central Asia High income .. NaN
6 6 NaN Angola AGO NaN Sub-Saharan Africa Upper middle income IBRD NaN
7 7 NaN Antigua and Barbuda ATG NaN Latin America & Caribbean High income IBRD NaN

Some things to note:

Useful options of read_excel:

  • sheetname : string, int, mixed list of strings/ints, or None, default 0

    Strings are used for sheet names, Integers are used in zero-indexed sheet positions.

    Lists of strings/integers are used to request multiple sheets.

    Specify None to get all sheets.

    str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

    Available Cases

      Defaults to 0 -> 1st sheet as a DataFrame
      1 -> 2nd sheet as a DataFrame
      “Sheet1” -> 1st sheet as a DataFrame
      [0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
      None -> All sheets as a dictionary of DataFrames
  • header : int, list of ints, default 0

    Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex

  • skiprows : list-like

    Rows to skip at the beginning (0-indexed)

  • index_col : int, list of ints, default None

    Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex

SQL tables

The use of SQL databases as data source is a very useful tool, however it won't be covered in these notes.

The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API.

Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database.

Here are some examples:

http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql

Unstructured data

In previous sessions we have seen how to create DataFrames using dicts and lists. This option is still a valid option to load textfiles as a DataFrame, just by open a file as a regular text file and the parsing lines.

To convert the text file to a DataFrame we will follow 5 steps:

1: Read data

We can do this with the common Python approach for reading a text file.


In [17]:
import re
import pandas as pd

fname = "../data/unstructured_data.txt"
with open(fname) as f:
    content = f.readlines()

In [2]:
content


Out[2]:
['# data from experiment A\n',
 '/Name/Age/Time\n',
 'Alice,22,23:20\n',
 'BOB,,26:43\n',
 '#the following people has a penalization of 3minutes\n',
 'Charlie ,"22",24:09\n',
 ' David,,\n',
 '##############\n']

2: Select lines containing data

A very useful way to do this is by using regular expressions.


In [18]:
data = [l for l in content if re.match("^(?!#).+",l)]
data
header = data[0].strip("\n")
data = data[1:]
print([header] + data)


['/Name/Age/Time', 'Alice,22,23:20\n', 'BOB,,26:43\n', 'Charlie ,"22",24:09\n', ' David,,\n']

3: Split data into separate fields

This can be done by applying string's split function


In [19]:
data_str = [header.split("/")[1:]]+[l.split(",") for l in data]
data_str


Out[19]:
[['Name', 'Age', 'Time'],
 ['Alice', '22', '23:20\n'],
 ['BOB', '', '26:43\n'],
 ['Charlie ', '"22"', '24:09\n'],
 [' David', '', '\n']]

4: Convert to DataFrame

To standarize rows we will load the structures that we already have to a DataFrame. Remember that the constructor of a DataFrame using list needs the list to be of the same length.


In [20]:
df = pd.DataFrame(data_str[1:], columns=pd.Series(data_str[0]))
df


Out[20]:
Name Age Time
0 Alice 22 23:20\n
1 BOB 26:43\n
2 Charlie "22" 24:09\n
3 David \n

5: Normalize and coerce to correct types

We can use to_numeric. This function accepts the argument error:

  • errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

      If ‘raise’, then invalid parsing will raise an exception
      If ‘coerce’, then invalid parsing will be set as NaN
      If ‘ignore’, then invalid parsing will return the input

In [6]:
pd.to_numeric(df["Age"],errors="ignore")


Out[6]:
0      22
1        
2    "22"
3        
Name: Age, dtype: object

Applying Operations in Pandas

Both DataFrames and Series have methods to apply functions over elements in the structure. These operations take a function (general, user defined or lambda) and apply it to each element value in the DataFrame or Series row or column wise.

For DataFrames we have:

  • apply: applies the function to each column of the DataFrame
  • applymap: applies the function to each cell of the DataFrame

In [5]:
import numpy as np


df2 = pd.DataFrame(np.random.randn(4, 3), 
                   columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df2)
f = lambda x: x.max() - x.min()
print(df2.apply(lambda x: x.max() - x.min(),axis=0))
print(df2.apply(f,axis=1))

g = lambda x: x**2
print(df2.applymap(g))


               b         d         e
Utah    1.307807 -0.619167 -0.696909
Ohio    0.165332  0.927382  0.783865
Texas   1.634361  1.155594  0.616263
Oregon  0.155251  0.508423  0.208182
b    1.479110
d    1.774761
e    1.480775
dtype: float64
Utah      2.004717
Ohio      0.762050
Texas     1.018098
Oregon    0.353172
dtype: float64
               b         d         e
Utah    1.710360  0.383368  0.485683
Ohio    0.027335  0.860038  0.614445
Texas   2.671137  1.335399  0.379780
Oregon  0.024103  0.258494  0.043340

In [6]:
np.random.randn(4, 3)


Out[6]:
array([[ 0.43963213, -0.34952959, -2.20163386],
       [-0.92092226,  1.5762455 , -1.6822622 ],
       [ 0.01835854,  0.3780393 ,  2.61338955],
       [ 1.49569518,  1.37980042, -1.54020785]])

In [14]:
def my_func(x):
    return x.mean()
df2.apply(my_func, axis=1)


Out[14]:
Utah     -0.002756
Ohio      0.625527
Texas     1.135406
Oregon    0.290619
dtype: float64

For Series we have:

  • apply: applies the function to each element of the Series
  • map: applies the function to each element of the Series

More about applymap, apply and map: http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas


In [9]:
s = pd.Series(np.arange(5))
print(s)
print(s.map(lambda x: x**2))
print(s.apply(lambda x: x**2))


0    0
1    1
2    2
3    3
4    4
dtype: int32
0     0
1     1
2     4
3     9
4    16
dtype: int64
0     0
1     1
2     4
3     9
4    16
dtype: int64

String Manipulation

(from http://pandas.pydata.org/pandas-docs/stable/text.html)

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array.

These methods exclude missing/NA values automatically and follow (more or less) the same syntax than the bulitin string object under the str naming space. Lets use the DataFrame from unstructured dataset:


In [21]:
df["Name"] = df["Name"].str.title()
print(df["Name"].str.len())
df


0    5
1    3
2    8
3    6
Name: Name, dtype: int64
Out[21]:
Name Age Time
0 Alice 22 23:20\n
1 Bob 26:43\n
2 Charlie "22" 24:09\n
3 David \n

In [22]:
df.columns


Out[22]:
Index(['Name', 'Age', 'Time'], dtype='object')

In [23]:
df.columns.str.lower()


Out[23]:
Index(['name', 'age', 'time'], dtype='object')

The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:


In [24]:
import numpy as np

df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],index=range(3))
df


Out[24]:
Column A Column B
0 0.408358 1.109967
1 0.774101 0.639438
2 0.519939 -0.346797

Since df.columns is an Index object, we can use the .str accessor


In [25]:
print(df.columns.str.strip())

print(df.columns.str.lower())


Index(['Column A', 'Column B'], dtype='object')
Index([' column a ', ' column b '], dtype='object')

These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lowercasing all names, and replacing any remaining whitespaces with underscores:


In [26]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df


Out[26]:
column_a column_b
0 0.408358 1.109967
1 0.774101 0.639438
2 0.519939 -0.346797

Some things to note:

  1. Notice that if variables can't be coearced to any other type, Pandas will coerce to str. Its important to know how to manipulate str

  2. str operations can also be perfomed using apply.

Splitting and Replacing Strings

Methods like split return a Series of lists:


In [27]:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')


Out[27]:
0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

Elements in the split lists can be accessed using get or [] notation:


In [28]:
print(s2.str.split('_').str.get(1))
print(s2.str.split('_').str[1])


0      b
1      d
2    NaN
3      g
dtype: object
0      b
1      d
2    NaN
3      g
dtype: object

Easy to expand this to return a DataFrame using expand.


In [29]:
s2.str.split('_', expand=True)


Out[29]:
0 1 2
0 a b c
1 c d e
2 NaN NaN NaN
3 f g h

It is also possible to limit the number of splits:


In [30]:
s2.str.split('_', expand=True, n=1)


Out[30]:
0 1
0 a b_c
1 c d_e
2 NaN NaN
3 f g_h

rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:


In [31]:
s2.str.rsplit('_', expand=True, n=1)


Out[31]:
0 1
0 a_b c
1 c_d e
2 NaN NaN
3 f_g h

Methods like replace and findall take regular expressions, too:


In [32]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3


Out[32]:
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6     NaN
7    CABA
8     dog
9     cat
dtype: object

In [33]:
s3.str.replace('^.a|dog', 'XX-XX ', case=False)


Out[33]:
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

Indexing with .str

We can access directly to all text positions in a text variable using str indexing with [] notation. If any strings are out of bounds, NaN is returned.


In [34]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
print(s.str[0])
print(s.str[1])


0      A
1      B
2      C
3      A
4      B
5    NaN
6      C
7      d
8      c
dtype: object
0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

Extracting Substrings

Extract first match in each subject (extract)

The extract method accepts a regular expression with at least one capture group.

Extracting a regular expression with more than one group returns a DataFrame with one column per group.


In [35]:
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)', expand=False)


Out[35]:
0 1
0 a 1
1 b 2
2 NaN NaN

The expand attribute to False returns a Series when possible. If we assign it to True a DataFrame will always be returned. If multiple regular expressions are grouped, the DataFrame is returned in both cases.


In [36]:
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])', expand=False)


Out[36]:
0      a
1      b
2    NaN
dtype: object

In [37]:
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])', expand=True)


Out[37]:
0
0 a
1 b
2 NaN

Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains NaN.

Named groups like


In [38]:
pd.Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)', expand=False)


Out[38]:
letter digit
0 a 1
1 b 2
2 NaN NaN

and optional groups like


In [39]:
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)', expand=False)


Out[39]:
0 1
0 a 1
1 b 2
2 NaN 3

Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.

To know more about expand attribute (e.g. when used with Index) refer to: http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings

Extract all matches in each subject (extractall)

The extractall method returns every match. The result of extractall is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.


In [ ]:
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)

In [ ]:
s.str.extractall(two_groups)

How to access multiindex?

You can think in multindex as a tupled index:


In [ ]:
mi_df = s.str.extractall(two_groups)
print(mi_df.columns)
print(mi_df.index)
print(mi_df.loc[("A",0)])

print(mi_df.T["A",0])

We can use xs accessor with level parameter to filter subindex values:


In [ ]:
extract_result = s.str.extract(two_groups, expand=True)
extractall_result = s.str.extractall(two_groups)
print(extractall_result)
print(extractall_result.xs(0, level="match"))
print(extractall_result.xs(1, level="match"))
Testing for Strings that Match or Contain a Pattern

You can check whether elements contain a pattern:


In [ ]:
pattern = r'[a-z][0-9]'
pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)

or match a pattern:


In [ ]:
pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)

Methods like match, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:


In [ ]:
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4.str.contains('A', na=False)
#explain that bool can be sumed

In [ ]:
df1 = pd.DataFrame([[1],[2],[3],[4],[5]], 
                   index=['one','two','three','four','five'], 
                   columns=['number'])
df1

In [ ]:
df2 = pd.DataFrame([['a'],['b'],['c'],['d'],['e']], 
                   index=['one','too','three','fours','five'], 
                   columns=['letter'])
df2

In [ ]:
import difflib 

difflib.get_close_matches
df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

df1.join(df2)

Type Conversion

As already stated, when not possible to convert (coerce) to a defined data type, Pandas converts to str. We already worked with strings, we know how to apply functions element and column wise, so it won't be difficult to convert types by ourselves.

Pandas provide some functions to convert types:

  • to_numeric(arg[, errors]): Convert argument to a numeric type.
  • to_datetime(*args, kwargs):** Convert argument to datetime.
  • to_timedelta(*args, kwargs):** Convert argument to timedelta
to_numeric

In [41]:
s = pd.Series(["10","0.9","56","87'6","34'89"])

In [44]:
# we can raise, ignore or coerce
pd.to_numeric(s,errors="coerce")


Out[44]:
0    10.0
1     0.9
2    56.0
3     NaN
4     NaN
dtype: float64

In [51]:
pd.to_numeric(s.str.replace("'","."),errors="coerce")


Out[51]:
0    10.00
1     0.90
2    56.00
3    87.60
4    34.89
dtype: float64

Then we can use str.replace to solve this problem. Be careful at applying str over a Series. It has to contain obly str object, otherwise it will return NaN.


In [ ]:
pd.to_numeric(s.str.replace("'","."))

In [53]:
df.columns


Out[53]:
Index(['column_a', 'column_b'], dtype='object')

If we recover the unstructured DataFrame, now we can convert it to the correct data type:


In [52]:
df["Name"] = df["Name"].str.title().str.strip()
df["Age"] = pd.to_numeric(df["Age"].str.replace("\"",""))
df


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-52-0fe7148a1af1> in <module>()
----> 1 df["Name"] = df["Name"].str.title().str.strip()
      2 df["Age"] = pd.to_numeric(df["Age"].str.replace("\"",""))
      3 df

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

~\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Name'
to_timedelta:

Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.


In [ ]:
#this does not work
#pd.to_timedelta(df["Time"].str.strip().map(lambda x: "00:"+x),errors="coerce")

#workaround
def str_convert(x):
    return pd.NaT if len(x)==0 else "00:"+x

pd.to_timedelta(df["Time"].str.strip().map(str_convert),box=True)

Technically Correct Data with Pandas

  1. Data is stored in a DataFrame with suitable columns and row indexes
  2. Each column of the DataFrame the Pandas type that adequately represents the value domain of the variable in the column

Exercices

Exercice: Read iqsize.csv using read_csv. Use na_values to assign missing data.

Exercise: Read country_info_worldbank.xls using read_xls. Use skiprows to get rid of empty lines.

Exercise: Try to load papers.lst reading it line by line taking into account that the fields are the following ones:

  • year
  • unknown
  • conference
  • authors (linked with &)
  • paper title

Exercise: Repeat the previous exercise taking advantage of the str functions.

Exercise: Load books.csv. Extract Author or Unknown. Then work with the title. Erase Irrelevant data and try to get as many authors and date as possible.

Exercise: Convert to Technically Correct Data: iqsize.csv.

Exercise: Convert to Technically Correct Data: country_info_worldbank.xls.